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Abstract 

Lattice spin models are useful for studying critical phenomena and allow the extraction of equi- 
librium and dynamical properties. Simulations of such systems are usually based on Monte Carlo 
(MC) techniques and the main difficulty often being the large computational effort needed when 
approaching critical points. In this work it is shown how such simulations can be accelerated with 
the use of NVIDIA Graphics Processing Units (GPUs) using the CUDA programming architecture. 
We have developed two different algorithms for lattice spin models, the first useful for equilibrium 
properties near a second order phase transition point and the second for dynamical slowing down 
near a glass transition. The algorithms are based on parallel MC techniques and speedups from 70 
to 150 fold over conventional single-threaded computer codes are obtained using consumer-grade 
hardware. 
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I. INTRODUCTION 



In most common cases, computer programs are written serially: to solve a problem, an 
algorithm is constructed and implemented as a serial stream of coded instructions. These 
instructions are executed on a Central Processing Unit (CPU) on one computer. Momen- 
tarily disregarding specific cases for which modern CPU hardware is optimized, only one 
instruction may be executed at a time; on its termination, the next instruction begins its 
execution. On the other hand, parallel computing involves the simultaneous use of multiple 
computational resources. In recent years, massively parallel computing has become a valu- 
able tool, mainly due to the rapid development of the Graphics Processing Unit (GPU), a 
highly parallel, multi-threaded, many core processor. However, due to the specialized nature 
of the hardware, the technical difficulties involved in GPU programming for scientific use 
stymied progress in the direction of using GPUs as general-purpose parallel computing de- 
vices. This situation has begun to change more recently, as NVIDIA introduced CUDA™, 
a general purpose parallel computing architecture with a new parallel programming model 
and instruction set architecture (similar technology is also available from ATI but was not 
studied by us). CUDA comes with a software environment that allows developers to use C 
(C++, CUDA FORTRAN, OpenCL, and DirectCompute are now also supported^) for pro- 
gram development, exposing an application programming interface and language extensions 
that allow access to the GPU without specialized assembler or graphics-oriented interfaces. 
Nowadays, there are quite a few scientific applications that are running on GPUs; this 
includes quantum chemistry applications^, quantum Monte Carlo simulations^, molecu- 
lar dynamics^^, hydrodynamics^, classical Monte Carlo simulations^^, and stochastic 
processes^^oi. 

In this work we revisit the problem of parallel Monte Carlo simulations of lattice spin 
models on GPUs. Two generic model systems were considered: 

1. The two-dimensional (2D) Ising modeP^ serves as a prototype model for describing 
critical phenomena and equilibrium phase transitions. Numerical analysis of critical 
phenomena is based on computer simulations combined with finite size scaling tech- 
niques 22 . Near the critical point, these simulations require large-scale computations 
and long averaging, while current GPU algorithms are limited to small lattice sizes^ 
or to spin | systems^. Thus, in three-dimensions (3D), the solution of lattice spin 
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models still remains a challenge in statistical physics. 

2. The North-East modeP^ serves as a prototype for describing the slowing down of glassy 

systems. The computational challenge here is mainly to describe correlations and 

fluctuations on very long timescales, as the system approaches the glass temperature 

(concentration)^. As far as we know, simulations of facilitated models on GPUs 

have not been explored so far, but their applications on CPU has been discussed 
extensivelj^HS] 

We develop two new algorithms useful to simulate these lattice spin models. For certain 
conditions, we obtain speedups of two orders of magnitude in comparison with serial CPU 
simulations. We would like to note that in many GPU applications, speedups are reported 
in Gflops, while in the present work, we report speedups of the actual running time of the 
full simulation. The latter provides a realistic estimate of the performance of the algorithms. 
The paper is organized as follows: Section III] comprises a brief introduction to GPUs and 



summarizes their main features. Section III contains a short overview of lattice spin models 



and the algorithms developed. Section IV includes a short discussion of the random number 



generator used in this work. Section [V] presents the results and the comparisons between 



CPU and GPU performance. Finally, Section VI concludes. 



II. GPU ARCHITECTURE 



In order to understand GPU programming we provide a sketch of NVIDIA's GPU device 
architecture. This is important for the development of the algorithms reported below, and 
in particular for understanding the logic and limitations behind our approach. By now 
three generations of GPUs have been released by NVIDIA (G80, GT200 and the latest 
architecture codename "Fermi"). In table [T] we highlight the main features and differences 
between generations. 



A. Hardware specifications 

On a GPU device one finds a number of Scalar Multiprocessors (SMs). Each multiproces- 
sor contains 8/32 (architecture dependent) Scalar Processor cores (SPs), a Multi-threaded 
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Instruction Unit (MTIU), special function units for transcendental numbers and the execu- 
tion of transcendental instructions such as sine, cosine, reciprocal, and square root, 32-bit 
registers and the shared memory space. In general, unlike a CPU, a GPU is specialized 
for compute-intensive, highly parallel computation - exactly the task graphics rendering re- 
quires - and is therefore designed such that more transistors are devoted to data processing 
rather than data caching and flow control. This is schematically illustrated in figure [TJ and 
makes GPUs less general-purpose but highly effective for data-parallel computation with 
high arithmetic intensity. Specifically, they are optimized for computations where the same 
instructions are executed on different data elements (often called Single Instruction Multi- 
ple Data, or SIMD) and where the ratio of arithmetic operations to memory operations is 
high. This puts a heavy restriction on the types of computations that optimally utilize the 
GPU, but in cases where the architecture is suitable to the task at hand it speeds up the 
calculations significantly. 



Device 

Multiprocessor N (SM) 



Multiprocessor 2 (SM) 
Multiprocessor! (SM) 




GPU 



(a) (b) 

Figure 1: (a) A schematic delineation that illustrates the ratio of transistors devoted to data pro- 
cessing (ALUs) to those of data caching and flow control in CPUs and GPUs. This is a manifestation 
of the different philosophies between the "all purpose" CPU to the "dedicated purpose" GPU. (b) 
A visual hardware model that describes the memory architecture and hierarchy of a GPU device. 
(Courtesy of NVIDIA). 
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B. Memory architecture 



NVIDIA's GPU memory model is highly hierarchical and is divided into several layers: 

Registers: The fastest form of memory on the GPU. Usually automatic variables declared 
in a kernel reside in registers, which provide very fast access. 

Local Memory: Local memory is a memory abstraction that implies "local" in the scope of 
each thread. It is not an actual hardware component of the SM. In fact, local memory 
resides in device memory allocated by the compiler and delivers the same performance 
as any other global memory region. 

Shared memory: Can be as fast as a register when there are no bank conflicts or when 
reading from the same address. It is located on the SM and is accessible by any thread 
of the block from which it was created. It has the lifetime of the block (which will be 
defined in the next subsection). 

Global memory: Potentially about 400 — 600 times slower than register or shared memory. 
This module is accessible by all threads and blocks. Data transfers from and to the 
GPU are done from global memory. 

Constant memory: Constant memory is read only from kernels and is hardware optimized 
for the case when all threads read the same location(i.e. it is cached). If threads read 
from multiple locations, the accesses are serialized. 

Texture memory: Graphics processors provide texture memory to accelerate frequently 
performed operations. The texture memory space is cached. The texture cache is op- 
timized for 2D spatial locality, so threads of the same warp that read texture addresses 
that are close together will achieve best performance. Constant and texture memory 
reside in device memory. 

C. CUDA programing model 

As has been mentioned, CUDA includes a software environment that allows developers 
to develop applications mostly in the C programming language. C for CUDA extends C 
by allowing the programmer to define C functions (kernels) that, when called, are executed 
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N times in parallel by iV different CUDA threads on the GPU scalar processors. Before 
invoking a kernel, the programmer needs to define the number of threads (N) to be cre- 
ated. Moreover the programmer needs to decide into how many blocks these threads will be 
divided. When the kernel is invoked, blocks are distributed evenly to the different multipro- 
cessors (hereby, having p multiprocessors, requires a minimum of p blocks to achieve 100% 
utilization of the GPU). The blocks might be 1, 2 or 3 dimensional with up to 512 threads 
per block. Blocks are organized into a 1 or 2 dimensional grid containing up to 65, 535 
blocks in each dimension. Each of the threads within a block that execute a kernel is given 
a unique thread ID. To differentiate between two threads from two different blocks, each of 
the blocks is given a unique ID as well. Threads within a block can cooperate among them- 
selves by sharing data through shared memory. Synchronizing thread execution is possible 
within a block, but different blocks execute independently. Threads belonging to different 
blocks can execute on different multiprocessors and must exchange data through the global 
memory. There is no efficient way to synchronize block execution, that is, in which order 
or on which multiprocessor they will be processed, thus, an efficient algorithm should avoid 
communication between blocks as much as possible. 

Much of the challenge in developing an efficient GPU algorithm involves determining an 
efficient mapping between computational tasks and the grid/block/thread hierarchy. The 
multiprocessor maps each thread to one scalar processor, and each thread executes indepen- 
dently with its own instruction address and register state. The multiprocessor SIMT (Single 
Instruction Multiple Threads) unit creates, manages, schedules, and executes threads in 
groups of 32 called warps. When a multiprocessor is given one or more thread blocks to exe- 
cute, it splits them into warps that get scheduled by the SIMT unit. The SIMT unit selects 
a warp that is ready to execute and issues the next instruction to the active threads of the 
warp. A warp executes one common instruction at a time, so full efficiency is realized when 
all threads of a warp agree on their execution path (on the G80 and GT200 architectures, 
it takes 4 clock cycles for a warp to execute, while on the Fermi architecture it takes only 
one clock cycle). If threads of a warp diverge via a data dependent conditional branch, the 
warp serially executes each branch path taken, disabling threads that are not on that path. 
When all paths terminate, the threads converge back to the same execution path. Branch 
divergence occurs only within a warp: different warps execute independently regardless of 
whether they are executing common or disjointed code paths. More information can be 
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found in RefR 





G80 


GT200 


Fermi 


Maximum number of multiprocessors 


; 16 


30 


16 


Scalar cores per multiprocessor 


8 


8 


32 


Double precision capability 


none 30 


| FMA ops/clock 


256 FMA ops/clock 


Special function units 


2 


2 


4 


Warp schedulers per multiprocessor 


1 


1 


2 


Shared memory 


16 KB 


16 KB 


48 KB 


Concurrent kernels 


1 


1 


Up to 16 


Table I: Summary of 


NVIDIA's 


different architect 


ures 



III. LATTICE SPIN MODELS 

We consider a general lattice model of spins that are placed on a square lattice of di- 
mensions d = 1,2,3.... The spins may be of any dimension and may acquire continuous 
or discrete values. In the applications reported below, we focus, for simplicity, on the case 
where the spins at lattice site i take discrete values of Sj = ±1, but this can be easily ex- 
tended to any spin dimension and value. The interactions between the spins is given by the 
Hamiltonian: 

H = J ij S i S j + ^2 B i s i- (!) 

In the above equation, the first sum is usually carried over nearest neighbors only (which 
we will denote (ij)). Jij is the interaction parameter and may be constant, discrete or 
continuous and Bi is an external field. The above Hamiltonian can be used to study equi- 
librium properties as in the Ising model and spin glass models (Edwards- Anderson modeP^, 
Sherrington-Kirkpatrick modeP^, random orthogonal modeP^etc), or the dynamic behav- 
ior as in facilitated spin models (Fredrickson and Andersen^, Jackle-Eisinger north-east 
modeP^', etc.). Simulations of such systems are based on Monte Carlo techniques^. In 
most CPU implementations, the algorithms for equilibrium or dynamic simulations do not 
differ significantly. However, as will become clear below, they become very different when 
implemented on a GPU. Though the implementations described below are for two simple 
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cases, the extension of our approach to the spin glass models mentioned above or to other 
facilitated spin models is straightforward. 

A. Monte Carlo simulation of the 2D Ising model 

The simplest 2D Ising modeP^ describes^ magnetic dipoles (or spins) placed on a 2D 
square lattice with one spin per cell. We limit the discussion to the spin | case, where each 
spin has only two possible orientations, "up"and "down". Each spin interacts with its nearest 
neighbors only, with a fixed interaction strength. In the absence of an external magnetic 
field, the Hamiltonian is given by 

H = -J'£ i 8 i 8 j , (2) 

m 

where (ij) represents a sum over nearest neighbors and J determines the energy scale. 
Monte Carlo simulation techniques based on the Metropolis algorithm 3 ^ are perhaps the 
most popular route to obtain the thermodynamic properties of this model. In a CPU 
implementation of the Metropolis algorithm, a spin is selected at random and an attempt 
to flip the spin is accepted with the Metropolis probability P acc = min[l, exp(— AH/T)], 
where T is the temperature in units of energy. In the GPU algorithm developed here, we 
will take advantage of the fact that spins interact only with their nearest neighbors, and 
thus the problem can be divided into non-interacting domains. The generalization to the 
case of finite interacting regions is straightforward. The algorithm is as follows: 

1. Randomly initialize the lattice (this is done on the CPU). 

2. Copy lattice to the GPU. 

3. Divide the lattice into Q sub-lattices, each with P spins. 

4. A grid of Q thread-blocks is formed. Every thread-block contains j threads. 

5. Every block copies a sub-lattice including its boundaries from the global memory to 
the shared memory. To avoid bank conflicts and save precious shared memory space, 
the short data type was used to form the lattice (figure [2]). 
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6. Within the block, every thread is in charge of 4 spin sites (a sub block of 2 x 2). At 



first all red spins are updated (figure 3(a)), i.e. all threads are active. Once a thread 



finished updating the red spin it continues to update its blue spin (figure 3(b) ). Since 
no native block synchronization exists, before updating the remaining spins, we make 
sure that all blocks finished the first two steps. This is done by recopying the data 
from the shared memory (boundaries excluded) on to the global memory and ending 
the kernel. 

7. Relaunch the kernel with the configuration generated in the previous step. Redo steps 



5 and 6, only this time the green and white spins are being updated (figure 3(c) ). 

8. Recopy data to global memory. 

This completes one Monte Carlo step (or one lattice sweep). In our implementation we chose 
sub-lattices of 32 x 32 in size and blocks of 256 threads, as this choice turned out to be the 
most efficient. Using more threads does in fact reduce the time it takes the block to copy 
a sub-lattice to the shared memory, but then the ratio of arithmetic operations to memory 
operations is low and performance is poor. To obtain thermodynamic average properties, 
we use the fact the CPU and GPU can work in parallel and the lattice is copied from the 
device to the host from time to time, so averages can be calculated on the CPU while the 
GPU continues to sweep the lattice. We note in passing that similar algorithms have been 
proposed by Tobias et alW^ and Blocked aL^The former approach is restricted to lattices 
with up to 1024 x 1024 spins in 2D (assuming spins are stored as integer data type), 
whereas the algorithm presented in this work is applicable to larger systems, is designed for 
coalesced global memory access and avoids bank conflicts, all of which makes better use of 
the GPU architecture. The limit of system size in our approach is related to the size of 
the global memory on the GPU, which is typically on the order of 1 — 4GB. This amounts 
to maximum system sizes of 20, 000 x 20, 000 - 40, 000 x 40, 000 spins. The algorithm of 
Block et alW^ is applicable to much larger systems and deals with the issue of multi-GPU 
programming, but is currently restricted to spin | systems while the algorithm presented 
in this work is suitable for the more generalized Potts modeP^l and gives approximately the 
same speedups in comparison to an equivalent CPU code. 
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Figure 2: Illustration of the algorithm for an Ising model implemented on a GPU device. Every 
block handles a sub lattice. Every thread is assigned 4 spin sites according to its ID and block. In 
order to improve memory access each block copies its sub-lattice on to shared memory. Since a spin 
needs its 4 nearest neighbors to update, extra boundary sites are copied as well (blueish square). 




(a) (b) (c) 

Figure 3: As native block synchronization is not available on a GPU, one must end the kernel in 
order to achieve it. This is only necessary after the red and blue spin where updated (as can be 
clearly seen, they do not interact) and before moving on to the green and white spins, which in 
turn do interact with the aforementioned updated red and blue spins. 

To acquire the correct thermodynamic equilibrium state, the chosen set of Monte Carlo 
moves must satisfy either detailed balance or the weaker balance condition. On the CPU 
detailed balance is rigorously satisfied when randomly selecting a spin within the Metropolis 
algorithm. For the GPU, however, the approach we developed breaks detailed balance and 
only balance is satisfiecP^ZI. This is a sufficient condition to ensure that the algorithm 
describes the correct Boltzmann distribution. 
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B. The North-East model 



The North-East model is based on the Ising model Hamiltonian with special constraints 
that are used to model facilitated dynamic d 23 * 31 * 38 l The Hamiltonian is given by: 

H = jJ2 s i s i+ B J2 s h (3) 

(hi) i 

where Sj, J, B and (ij) are described above. What makes this model different from the 
previous one is a constraint imposed on the transition probability to flip a spin. This 
probably is zero unless the spin's upper (north) and right (east) neighbors point "up". In 
the latter case, one accepts a flip with the same Metropolis probability given by P acc = 
min[l, exp(— AH/T)]. Thus, the thermodynamics of the model are the same as the Ising 
model, but the Monte Carlo dynamics generated by the above rule is quite different. In most 
applications reported in the literature one takes J = 0. For J ^ the dynamics generated 
by this model are richer and show an interesting re-entrant transition^. When J = the 
thermodynamics are trivial and the coupling between neighboring spins depends only on 
the aforementioned dynamical constraint. The lattice is initialized so spins point "up" with 
the probability c, which is also the equilibrium density of spins pointing "up", and can be 
expressed as c = - z e~ x l T ^ where the partition function is z = e 1//T + e _1//T and T is the 
unit-less temperature. Due to the dynamical constraints, if c is too low, one finds domains 
of spins that are stuck. For a high values of c on the other hand, all spins are flippable. As a 
consequence, there is a critical concentration c* below which the system is not ergodic. This 
transition from ergodic to non-ergodic behavior is modeled by the spin-spin autocorrelation 
function 

v 1 - ^ 

where (si) = 2c— 1 is the average spin polarization and Si(t) is the spin polarization at Monte 
Carlo step t for site i. We expect the function to decay to zero for an initial concentration 
c > c* and to decay to a finite value / (which is the fraction of spins that are stuck) for 
c < c*. The CPU implementation of the North-East model is identical to that described for 
the Ising model, with the additional constraint for the flipping probability. 
The GPU implementation for this model, similar as it may seem to the Ising model, is a 
bit cannier. We found out that applying the checkerboard algorithm (described in subsec- 
tion 



III A ) does not yield the same relaxation dynamics as the serial CPU implementation. 

II 



This is understandable, since this model imitates a diffusion process: for a spin to be able to 
change its configuration it is necessary that its north and east neighbors point up. If they 
do not, they in turn will also need their neighbors to point up to be able and change their 
configuration. Equilibration takes place by an up spin diffusion from "north-east" to "south- 
west". By sequentially (instead of randomly) sweeping the lattice we change the dynamics 
of this process. Such being the case, the corrected algorithm we developed is: 

1. Initialize the lattice so spins point up with probability c (on the CPU). 

2. Copy lattice to the GPU. 

3. Divide the lattice into Q sub-lattices, each with P spins. 

4. A grid of Q thread-blocks is formed. Every thread-block contains j threads. 

5. Blocks then randomly pick a sub-lattice (figure [4]) in such a way that two different 
blocks cannot pick the same sub-lattice. Every block copies a sub-lattice including its 
boundaries from the global memory to the shared memory. 

6. Within each block, A threads concurrently pick A spins from the sub-lattice randomly 
and update them. A is chosen to be a small fraction of P. 

7. Synchronize the block. 

8. Reiterate steps 6 — 7 q times, such that q ■ A = 

9. Copy back data from shared memory to global memory (to allow block synchroniza- 
tion). 

10. Relaunch the kernel with the same configuration and redo steps 5 — 8. 

11. Recopy data to global memory. 

This completes one lattice sweep. Again we use the fact the CPU and GPU can work in 
parallel, and the lattice is copied from the device to the host from time to time to store the 
spin's configuration for the computation of the autocorrelation function. 
The algorithm presented here does not preserve detailed balance nor the weaker balance 
condition, but since we are interested in its dynamics the given algorithm is correct. One 
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should note that in order to obtain the correct dynamics an initialization of the parameter 
A is needed. It is obvious that in the limit where Q — A = 1 the algorithm is in fact serial 
and preserves detailed balance (results for this limit are given by the black line in figure [5]). 
We can now use this result as a reference, and increase the number of threads and blocks 
that are working in parallel. There is an upper limit above which the results will diverge 
from the desired reference and the dynamics will no longer be correctly reproduced. Once 
A has been evaluated, the simulation can be performed. In figure [5] we provide a consistent 
test on the value of A for the North-East model at c = 0.42. 




Figure 4: Although the figure above looks similar to the Ising model implementation, in this case 
Sub-lattices and spins are chosen randomly. This preserves the dynamical behavior in comparison 
to CPU algorithms. Note that when copying the data to the shared memory, only the upper and 
right boundaries are necessary. 

It is possible to write an algorithm that maintains the weaker balance condition and at 
the same time preserve the dynamical relaxation of the model. The drawback of such an 
algorithm, however, is that it is 2.5x slower then the one presented above. The only 
changes are in steps 6 — 9: 

6. Within each block, A threads concurrently pick A red spins (figure [4]) from the sub- 
lattice randomly and update them. A is chosen to be a small fraction of P. 

7. Synchronize the block and pick A white spins from the sub-lattice randomly and update 
them. 
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Figure 5: Convergence tests of the parameter A for the North-East model at c = 0.42. The solid 
black line represents the reference value, which was obtained by running the simulation on a single 
thread - single block. This result is identical to the serial CPU result. The best performence 
is achieved for sub-lattices of 32 x 32 spins and for A = 16 threads, i.e., attempts to flip only 



l.E 



V ~ 16 
/0 ~ 32x32 



of the spins inside a sub-lattice. One can see that for higher values of A, the 



correlation diverges from the correct result. 



8. Copy back data from shared memory to global memory (to allow block synchroniza- 
tion). 

9. Relaunch the kernel with the same configuration and redo steps 6 — 8 three more times 
(this will complete one MC step). 



IV. PSEUDO RANDOM NUMBERS GENERATION 

Monte Carlo simulations rely heavily on the availability of random or pseudo-random 
numbers. In this work, random numbers were used to determine whether spin flips are 
accepted or rejected, in accordance with the Metropolis algorithm. As is widely known, the 
use of a poor quality PRNG (Pseudo Random Numbers Generator) may lead to inaccurate 
simulations.^ In the course of this work three different PRNGs were utilized: 

1. L'Ecuyer with Bays-Durham shuffle and added safeguards. 41 This PRNG has a long 
period (2 x 10 18 ) and is easy to implement on a CPU. Unfortunately, porting it to 
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GPUs causes a dramatic decrease in performance: this is mostly due to the fact that 
the implementation of this algorithm requires too many registers. 

2. The "minimal" random number generator of Park and Miller This PRNG has a 
period of 2 x 10 9 and is portable to GPUs. In practice, however, it proved to work 
poorly and the GPU simulation results were in poor agreement with reference values 
obtained with better PRNGs. 

3. Linear Congruential Random Number Generator (LCRNG) This PRNG has a pe- 
riod of 10 6 — 10 9 and provided good results even for very long runs. The LCRNG 
algorithm is very easy to port to the GPU and has the advantage of being very fast, 
requiring only a few operations per call. 

In our implementations (section , each thread used a separate LCRNG, creating its 
own sequence of pseudo-random numbers with a unique seed. The sequence (of thread i) is 
created as follows: 

xj+i = {a ■ xj hl + c) mod m (5) 

An appropriate choice of the coefficients is responsible for the quality of the LCRNG. We 
usecPa = 1,664,525 , c = 1,013,904,223 and m = 2 32 . The different seeds were created 
per thread according to 

x Th i+1 = ( 16g07 . x Th^ mod m ^ 
with xl ho = 1. Similar PRNGs were used for other GPU applications . 1 11 

V. RESULTS 

For comparison, CPU codes were executed on a PC with Intel Core2 Duo E7400 @ 2.8GHz 
processor, Intel RaisinCity motherboard with Intel G41 chipset and Kingstone 2GB, 800MHz 
RAM (only a single core was used for the calculations). The operating system was CEN- 
TOS. Codes were compiled with Intel C++ compiler (ICC) using all optimizations provided 
for best performance. Speedups reported here are given in terms of the complete applica- 
tion running time (from initialization till results are processed), rather than Gflops or spin 



15 



updates per second. This choice is important, as it most closely describes the "real" gain in 
practical simulations by the use of a GPU rather than a CPU. 



A. Ising model 

In order to verify the GPU implementation, we compared values of magnetization, energy 
and heat capacity as a function of temperature (temperature was taken in energy units) 
between GPU and CPU versions. In figure [6] results from a 128 x 128 spin lattice are 
presented. We find the results to clearly agree. In terms of acceleration we achieved a 
15 x factor for lattices with 512 x 512 spins on the GT 9600 GPU (G80 architecture). 
Implementing the same code on the new GTX 480 GPU, we achieved a factor of 150 x . This 
factor reduces to 2x for small lattices (32 x 32). The reason for this is that in small problems 
it becomes harder to hide memory access latencies, because there are not enough threads 
to execute between memory access operations. A theoretical analysis of our implementation 
shows that the GPU reaches full occupancy (a useful tool to check this is the "CUDA GPU 
Occupancy Calculator" which can be freely downloaded from Ref.^). The 10 fold factor 
obtained by the GTX 480 in comparison with the GT 9600 is easily understood when taking 
into consideration that the GTX 480 hasl5 SMs instead of 8 (on the GT 9600), 4 blocks 
can be active simultaneously instead of 3 and a warp executes in one clock cycle instead 
of 4. Thus the relative speedup is ^ • » ■ I = 10. From figure 8(a) it is obvious that the 



GPU reaches full occupancy only for lattices bigger than 128 x 128. Note also that near 
the critical temperature, the fluctuations and noise increase. Since the present work is not 
concerned with determining the critical behavior, we have used the same number of Monte 
Carlo sweeps for all temperatures. Estimation of the critical behavior requires much longer 
runs, and perhaps also larger systems. In this respect, the GPU approach developed here 
provides the means to increase numerical accuracy by more than one order of magnitude 
(noise scales with the square root of the number of MC steps), either by increasing the 
system size or by simulating longer Monte Carlo runs. 
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Figure 6: Comparison between the Monte Carlo simulation done on a CPU and on a GPU for a 
128 x 128 spin lattice. The temperature was stepwise reduced by 0.01 from T = 2.4 to T = 2.1. 
The critical temperature for this model is T c = 2/log(l + \/2). At each temperature, 2 x 10 7 
sweeps through the lattice were performed, during which 10, 000 different measurement were taken 
(after reaching equilibrium). In (a) we show the average spin magnetization, in (b) the average spin 
energy and in (c) the heat capacity. 



B. North-East model 

The North-East model has a critical concentration c* ~ 0.3, below which the dynamics 
break ergodicity. In figure [7] we show the results for spin-spin autocorrelation function for 
different concentrations above the critical value. Two lattice sizes were studied. Similar to 
the previous case, we find that the GPU results agree well with the CPU results, indicating 
that the proposed algorithm reproduces the correct dynamics. This is not trivial and depends 
on the value chosen for A. Moreover, as pointed out above, the algorithm used for the Ising 
model fails to produce correct relaxation times. As the system approached the critical 
concentration the dynamics become sluggish and the autocorrelation function decays slowly 
to zero. 
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Figure 7: Comparison between spin-spin autocorrelation functions calculated from simulations 
done on a CPU and on a GPU for spin lattices of different sizes and different initial concentrations. 
Results were obtained by averaging the autocorrelation functions of 30 different initial realization for 
every concentration, (a) Results for a 128 x 128 spin lattice. As can be seen for low concentrations 
approaching c*, relaxation time is very long (longer than the time we were willing to wait for the 
CPU results). Furthermore, at this concentration, more than 30 trajectory are required to average 
the autocorrelation function, (b) and (c) Results obtained from 512 x 512 spin lattices. 



In figure 8(b)| we compare the running times between the CPU and two GPU architectures. 
We achieved a 2x factor on the GT 9600 GPU and a ~ 70 x speedup running the same 
code on the new GTX 480 GPU. This factor reduces to 8x for smaller lattices (128 x 128 
and less). The GPU acceleration is nearly two orders of magnitude, implying that one can 
simulate the system closer to the critical density. This is important, since the behavior of 
relaxation near the critical density is not necessarily universal, and thus extrapolations are 
often tricky. 



VI. CONCLUSIONS AND SUMMARY 



We have developed algorithms to simulate lattice spin models on consumer-grade GPUs. 
Two prototype models have been considered: The Ising model describing critical phenomena 
at equilibrium and the North-East model describing glassy dynamics. We showed that for 
equilibrium properties of lattice models an impressive speedup of 150 x can be achieved. 
To simulate the dynamics of such models, a more sophisticated approach was developed in 
order to preserve the dynamical rules and outcome. Our algorithm for the dynamic model 
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Figure 8: (a) Processing times for the 2D Ising model. Times are shown as a function of the number 
of spins n per row which is related to the system size by N = n 2 . A maximum acceleration factor 
of 15 was achieved (on the GT 9600GPU). We believe a higher value can be obtained for larger 
lattices, (b) Speedups of two different GPU cards versus the CPU serial implementation for the 
North-East model. Again times are shown as a function of the number of spins n per row. 

reaches a ~ 70 x factor in comparison to the serial CPU implementation for large system 
sizes. Though the algorithms were performed on specific models, we feel they can be easily 
extended to a larger class of similar systems. 

Since the gain in computational power embodied by these results is in some cases two orders 
of magnitude, while the required GPU hardware is priced similarly to a CPU and can often 
be added to existing systems, the algorithms are certainly of interest. On the other hand, 
taking full advantage of it still requires savvy knowledge of the device and its capabilities and 
limitations, and the development of specific numerical algorithms. As the advantages of this 
useful technology become clear and knowledge about its implementation continues to build 
up and spread, we hope it will become more accessible to a wide variety of computationally- 
minded scientists. 
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